The following packages are required for analysis and visualization in this page:

# General data wrangling
library(tidyverse)
library(DT)
library(readxl)

# Visualization
library(plotly)
library(colorRamps)
library(RColorBrewer)
library(colorspace)
library(NeuralNetTools)
library(gridExtra)
library(ggplotify)
library(knitr)
library(kableExtra)
library(ggseg)

# Modeling
library(glmnet)
library(caret)
library(ranger)

# Ensemble building
library(caretEnsemble)

Load the models constructed in Modeling:

load("../Model_Results.RData")

I’ll evaluate model performance across the four ROI aggregation types:

Here, I will define “model performance” as the RMSE and R\(^2\) for agreement between predicted vs. actual CDR-Sum of Boxes change per year in the test dataset. There are seven models utilized, including four individual models and three stacked ensembles.

model.metrics.OG$ROI_Type <- "Original"
model.metrics.PCA$ROI_Type <- "PCA"
model.metrics.BK$ROI_Type <- "Braak-stage"
model.metrics.CTX$ROI_Type <- "Cortical-lobe"
model.metrics.full <- do.call(plyr::rbind.fill, list(model.metrics.OG,
                                                     model.metrics.PCA,
                                                     model.metrics.BK,
                                                     model.metrics.CTX))

I’ll use bar plots to visualize the RMSE and R\(^2\) metrics for each of the seven models:

rmse.pal <- c("#C0D3D9", "#7DBCDD", "#2D69A5", "#2A486C")
r2.pal <- c("#fcb9b2", "#f67e7d", "#843b62", "#621940")

p.rmse <- model.metrics.full %>%
  filter(Data=="Test", Metric=="RMSE") %>%
  mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=Model, y=Value, fill=ROI_Type)) +
  geom_bar(stat="identity", position=position_dodge()) +
  labs(fill="ROI Type") +
  theme_minimal() +
  ggtitle("RMSE Between Predicted vs. Actual CDR-SoB") +
  scale_fill_manual(values=rmse.pal) +
  ylab("RMSE") +
  theme(axis.text.x = element_text(angle=45, hjust=1),
        plot.title=element_text(hjust=0.5),
        strip.text = element_text(size=12, face="bold"),
        legend.position = "bottom")
ggplotly(p.rmse, width=900, height=300)
p.r2 <- model.metrics.full %>%
  filter(Data=="Test", Metric=="R2") %>%
  mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=Model, y=Value, fill=ROI_Type)) +
  geom_bar(stat="identity", position=position_dodge()) +
  labs(fill="ROI Type") +
  theme_minimal() +
  ggtitle("R2 Between Predicted vs. Actual CDR-SoB") +
  scale_fill_manual(values=r2.pal) +
  ylab("R2") +
  theme(axis.text.x = element_text(angle=45, hjust=1),
        plot.title=element_text(hjust=0.5),
        strip.text = element_text(size=12, face="bold"),
        legend.position = "bottom")
ggplotly(p.r2, width=900, height=300)

The RMSE doesn’t afford much distinction either by ROI averageing type or by predictive model. If anything, the RMSE does stand out as a bit larger for the original ROI configuration in the random forest-stacked ensemble (ensemble.rf) and the neural network (neural.net). However, the R\(^2\) agreement between predicted vs. actual CDR-SoB does differentiate among the models. The random forest model (random.forest) actually shows the highest R\(^2\) value for each ROI configuration, surpassing all three of the stacked ensemble models. The PCA-transformed data shows the largest R\(^2\) agreement of the four ROI configurations, followed closely by the original data configuration; however, even these values are <0.2, suggesting minimal association between the predictions and the actual values.

At the end of the day, none of the models and corresponding ROI configurations offered strong predictions for CDR-Sum of Boxes annual change based on regional changes in tau-PET tracer uptake. To improve this model in the future, I would consider including the baseline tau-PET SUVR value in a mixed-effects model; conceivably, an increase in 0.5 SUVR could have different implications for a subject with no tau-PET uptake at baseline versus a subject with high tau-PET uptake at baseline. Also, it’s entirely possible that there is a temporal lag between tau-PET uptake and cognitive decline; for example, tau accumulation in a given ROI (e.g. hippocampus) may precede cognitive decline by months or years, which would not be detected in this model. This is a complex pathophysiological dynamic that necessitates complex modeling supported by extensive domain knowledge.

That being said, I am interested in a yet-unexamined facet of this project: how does a given region of interest influence a given model? How important is e.g. the hippocampus relative to e.g. the insula?

To answer these questions, I will use the varImp function from caret, which computes the relative importance of each variable used as model input. Since variable importance is not readily identifiable for caretStack ensembles, I’ll focus exclusively on individual models. Variable importance doesn’t really apply for k-Nearest Neighbors (kNN), so I will examine the random forest, elastic net, and neural network regression models. For elastic net regression, I will also extract ROI coefficients.

First, I’ll examine the original ROI conformation:

rf.var.imp <- varImp(models.for.ensemble.OG$ranger)[[1]] %>%
  rownames_to_column(var="Term") %>%
  dplyr::rename("Importance"="Overall") %>%
  mutate(Model="Random Forest")

glmnet.var.imp <- varImp(models.for.ensemble.OG$glmnet)[[1]] %>%
  rownames_to_column(var="Term") %>%
  dplyr::rename("Importance"="Overall") %>%
  mutate(Model="Elastic Net")

glmnet.var.coef <- as.data.frame(as.matrix(coef(models.for.ensemble.OG$glmnet$finalModel, models.for.ensemble.OG$glmnet$bestTune$lambda))) %>%
  rownames_to_column(var="Term") %>% dplyr::rename("Coefficient"="1") %>%
  mutate(Model="Elastic Net")

glmnet.var <- left_join(glmnet.var.imp, glmnet.var.coef)

nnet.var.imp <- varImp(models.for.ensemble.OG$nnet)[[1]] %>%
  rownames_to_column(var="Term") %>%
  dplyr::rename("Importance"="Overall") %>%
  mutate(Model="Neural Network")

og.importance <- do.call(plyr::rbind.fill, list(rf.var.imp, glmnet.var, nnet.var.imp))
datatable(og.importance)
my.pal <- colorRampPalette(c("#C0D3D9", "#7DBCDD", "#2D69A5", "#2A486C"))(200)
p.importance <- og.importance %>%
  ggplot(data=., mapping=aes(x=Term, y=Importance, fill=Importance)) +
  geom_bar(stat="identity") +
  facet_wrap(Model ~ ., scales="free_y", nrow=3) +
  theme_minimal() +
  scale_fill_gradientn(colors=my.pal) +
  theme(axis.text.x = element_text(angle=90, hjust=1),
        axis.title=element_blank(),
        strip.text = element_text(size=14))

ggplotly(p.importance, width=900, height=800) %>%
  layout(xaxis = list(title = "Model Term", 
                      titlefont = list(size = 14)), 
         yaxis=list(title="Importance", 
                    titlefont = list(size = 12)),
         yaxis2 = list(title="Importance", 
                       titlefont = list(size = 12)),
         yaxis3 = list(title="Importance", 
                       titlefont = list(size = 12)))

It’s really interesting how differently these three models weigh each model term. The elastic net model placed some of the largest importance in the bankssts, entorhinal, hippocampus, and inferiortemporal ROIs, which typically show some of the heaviest tau pathology burdens in the human Alzheimer’s Disease brain. The neural network found sex to be the most important predictive variable by far, which is surprising and may contribute to the relatively low predictive accuracy associated with the neural network. The random forest identified the inferiorparietal, isthmuscingulate, insula, and parahippocampal ROIs as the most important predictor terms.

I’m curious as to how variable importance relates to the regularized coefficient calculated for each predictor term in the elastic net model:

my.pal <- colorRampPalette(c("#fcb9b2", "#f67e7d", "#843b62", "#621940"))(200)
p.enet <- og.importance %>%
  filter(Model=="Elastic Net") %>%
  ggplot(data=., mapping=aes(x=fct_reorder(Term, Coefficient,), y=Coefficient, 
                             fill=Importance)) +
  geom_bar(stat="identity") +
  xlab("Term") +
  theme_minimal() +
  scale_fill_gradientn(colors=my.pal) +
  ggtitle("Variable Coefficients & Importance in Elastic Net Regression") +
  theme(axis.text.x=element_text(angle=90),
        plot.title=element_text(hjust=0.5))

ggplotly(p.enet)

The most important features bookend the x-axis, and also have the largest magnitude of coefficients. It’s pretty surprising to note that the entorhinal cortex and hippocampus have two of the most negative coefficients of all the variables; a negative coefficient means that rate of tau accumulation in that ROI is negatively associated with CDR-Sum of Boxes score increase. The entorhinal cortex and hippocampus are two of the first gray matter regions to develop tau pathology in Alzheimer’s Disease according to the Braak staging paradigm, so I would expect that increased tau accumulation in these regions would be associated with increased CDR-Sum of Boxes scores, not a decrease (note: a higher CDR-Sum of Boxes score means greater cognitive impairment). One possible interpretation could be related to the temporal gap between tau neurofibrillary tangle accumulation and cognitive decline as described above; though further assessment would be warranted to interpret the implications of these negative coefficients.

I like to use the ggseg package to visualize these elastic net regression coefficients and variable importance metrics within the brain space:

ggseg.aparc <- read_excel("../ggseg_roi.xlsx", sheet=1) %>%
  mutate(Braak=as.numeric(as.roman(Braak)))

ggseg.aseg <- read_excel("../ggseg_roi.xlsx", sheet=2) %>%
  mutate(Braak=as.numeric(as.roman(Braak)))

my.pal <- colorRampPalette(c("#fcb9b2", "#f67e7d", "#843b62", "#621940"))(200)

p.roi.imp <- og.importance %>%
  filter(Model=="Elastic Net") %>%
  filter(!(Term %in% c("Sex_Male", "Age_Baseline"))) %>%
  left_join(., ggseg.aparc, by=c("Term"="tau_ROI")) %>%
  rename("region"="ggseg_ROI") %>%
  filter(!is.na(region)) %>%
  ggseg(atlas="dk", mapping=aes(fill=Importance, label=region)) +
  ggtitle("Elastic Net ROI Importance") +
  annotate(geom="text", x=410, y=-100, label="Left") +
  annotate(geom="text", x=1290, y=-100, label="Right") +
  scale_fill_gradientn(colors=my.pal) +
  theme(axis.text=element_blank(),
        axis.title.x = element_text(family="calibri"),
        legend.title=element_text(family="calibri"),
        legend.text=element_text(family="calibri"),
        plot.title=element_text(hjust=0.5, family="calibri"),
        text=element_text(family="calibri"))

ggplotly(p.roi.imp, dynamicTicks = T, tooltip=c("label", "fill"), width=900, height=300)

One interesting observation is that the ROIs with the largest relative importance are located at the lateral edges of the cortex rather than the medial edge, with the exception of the entorhinal and pericalcarine cortices.

The elastic net coefficient weights can also be viewed in the brain:

ggseg.aparc <- read_excel("../ggseg_roi.xlsx", sheet=1) %>%
  mutate(Braak=as.numeric(as.roman(Braak)))

ggseg.aseg <- read_excel("../ggseg_roi.xlsx", sheet=2) %>%
  mutate(Braak=as.numeric(as.roman(Braak)))

p.roi.coef <- og.importance %>%
  filter(Model=="Elastic Net") %>%
  filter(!(Term %in% c("Sex_Male", "Age_Baseline"))) %>%
  left_join(., ggseg.aparc, by=c("Term"="tau_ROI")) %>%
  rename("region"="ggseg_ROI") %>%
  filter(!is.na(region)) %>%
  ggseg(atlas="dk", mapping=aes(fill=Coefficient, label=region)) +
  ggtitle("Elastic Net ROI Coefficients") +
  annotate(geom="text", x=410, y=-100, label="Left") +
  annotate(geom="text", x=1290, y=-100, label="Right") +
  scale_fill_continuous_divergingx(palette = 'RdBu', rev=T, mid = 0) +
  theme(axis.text=element_blank(),
        axis.title.x = element_text(family="calibri"),
        legend.title=element_text(family="calibri"),
        legend.text=element_text(family="calibri"),
        plot.title=element_text(hjust=0.5, family="calibri"),
        text=element_text(family="calibri"))

ggplotly(p.roi.coef, dynamicTicks = T, tooltip=c("label", "fill"), width=900, height=300)